feat: Various improvements to memory pool configuration, logging, and documentation #2538

andygrove · 2025-10-08T16:42:20Z

Which issue does this PR close?

N/A

Rationale for this change

Now that Comet requires off-heap mode (except for when we are running tests), remove all of the confusing documentation about configuring Comet on-heap memory pools.

Also, add a new config for controlling what percentage of the off-heap memory pool can be used by Comet (required because Comet memory accounting is not accurate).

With this PR, there are now only two user-facing memory pool configs:

spark.comet.exec.memoryPool
spark.comet.exec.memoryPool.fraction

What changes are included in this PR?

There are now two separate memory pool configs - one public config for off-heap mode, and one internal config for on-heap mode for testing
All on-heap memory configs are now marked as internal, therefore no longer appear in the user guide
All on-heap related content has been removed from the user guide
Add new config option COMET_EXEC_MEMORY_POOL_FRACTION for limiting % of off-heap pool that Comet can use (required because Comet memory accounting is not accurate)
Improved logging of memory configuration in executors:

INFO CometExecIterator: memoryPoolType=fair_unified, offHeapSize=16384 MB, memoryFraction=0.9, memoryLimit=14745 MB, memoryLimitPerTask=1843 MB

How are these changes tested?

codecov-commenter · 2025-10-08T16:57:58Z

Codecov Report

❌ Patch coverage is 72.72727% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.17%. Comparing base (f09f8af) to head (4816437).
⚠️ Report is 643 commits behind head on main.

Files with missing lines	Patch %	Lines
...ain/scala/org/apache/comet/CometExecIterator.scala	65.30%	12 Missing and 5 partials ⚠️
...park/src/main/scala/org/apache/spark/Plugins.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2538      +/-   ##
============================================
+ Coverage     56.12%   59.17%   +3.04%     
- Complexity      976     1458     +482     
============================================
  Files           119      146      +27     
  Lines         11743    13685    +1942     
  Branches       2251     2363     +112     
============================================
+ Hits           6591     8098    +1507     
- Misses         4012     4360     +348     
- Partials       1140     1227      +87

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-10-08T18:25:03Z

spark/src/main/scala/org/apache/comet/CometExecIterator.scala

-      // The allocator thoughts the exported ArrowArray and ArrowSchema structs are not released,
-      // so it will report:
-      // Caused by: java.lang.IllegalStateException: Memory was leaked by query.
-      // Memory leaked: (516) Allocator(ROOT) 0/516/808/9223372036854775807 (res/actual/peak/limit)
-      // Suspect this seems a false positive leak, because there is no reported memory leak at JVM
-      // when profiling. `allocator` reports a leak because it calculates the accumulated number
-      // of memory allocated for ArrowArray and ArrowSchema. But these exported ones will be
-      // released in native side later.
-      // More to clarify it. For ArrowArray and ArrowSchema, Arrow will put a release field into the
-      // memory region which is a callback function pointer (C function) that could be called to
-      // release these structs in native code too. Once we wrap their memory addresses at native
-      // side using FFI ArrowArray and ArrowSchema, and drop them later, the callback function will
-      // be called to release the memory.
-      // But at JVM, the allocator doesn't know about this fact so it still keeps the accumulated
-      // number.
-      // Tried to manually do `release` and `close` that can make the allocator happy, but it will
-      // cause JVM runtime failure.
-
-      // allocator.close()


I removed this comment since it refers to an allocator that no longer exists in this code.

andygrove · 2025-10-10T15:12:45Z

common/src/main/scala/org/apache/comet/CometConf.scala

+          "Only applies to off-heap mode. " +
+          s"$TUNING_GUIDE.")
+      .doubleConf
+      .createWithDefault(1.0)


Default is 1.0 so that this change is not a breaking change

comphead · 2025-10-10T18:59:23Z

spark/src/main/scala/org/apache/comet/CometExecIterator.scala

+object CometExecIterator extends Logging {
+
+  def getMemoryConfig(conf: SparkConf): MemoryConfig = {
+    val numCores = numDriverOrExecutorCores(conf).toFloat


can number of cores be fractional? 🤔

No, this is an intConf. I updated this.

comphead · 2025-10-10T19:00:03Z

spark/src/main/scala/org/apache/comet/CometExecIterator.scala

+
+  def getMemoryConfig(conf: SparkConf): MemoryConfig = {
+    val numCores = numDriverOrExecutorCores(conf).toFloat
+    val coresPerTask = conf.get("spark.task.cpus", "1").toFloat


No, this is an intConf. I updated this.

comphead · 2025-10-10T19:01:47Z

spark/src/main/scala/org/apache/comet/CometExecIterator.scala

+      if (threads == "*") Runtime.getRuntime.availableProcessors() else threads.toInt
+    }
+
+    val LOCAL_N_REGEX = """local\[([0-9]+|\*)\]""".r


would be nice to comment what expression is looking for like local[*] pseudocode?

I added some comments

andygrove · 2025-10-14T00:18:30Z

@parthchandra @comphead @mbutrovich This is now ready for review. There have been some changes in scope today, so please re-read the PR description.

coderfender · 2025-10-14T00:25:19Z

common/src/main/scala/org/apache/comet/CometConf.scala

-      "The type of memory pool to be used for Comet native execution. " +
-        "When running Spark in on-heap mode, available pool types are 'greedy', 'fair_spill', " +
+      "The type of memory pool to be used for Comet native execution " +
+        "hen running Spark in on-heap mode. Available pool types are 'greedy', 'fair_spill', " +


Minor nitpick : Guess hen is a typo for when ? (probably no caps in available too ?)

Yes, thanks. 🐔

coderfender

Minor typo

parthchandra · 2025-10-14T00:28:51Z

common/src/main/scala/org/apache/comet/CometConf.scala

+    .doc(
+      "The type of memory pool to be used for Comet native execution " +
+        "when running Spark in off-heap mode. Available pool types are 'greedy', 'fair_spill', " +
+        "'greedy_task_shared', 'fair_spill_task_shared', 'greedy_global', 'fair_spill_global', " +


Are we not limiting the available memory pools?

I think we should, but as a separate PR

oops .. this was actually a copy-paste error - these are the on-heap pools. Updated.

comphead · 2025-10-14T01:19:58Z

Thanks @andygrove I think the PR is good, its waiting for make format

wForget · 2025-10-14T03:53:41Z

docs/source/user-guide/latest/tuning.md

-```
-
-When running in on-heap mode, Comet will use its own dedicated memory pools that are not shared with Spark.
+`fair_unified_global` allows any task to use the full off-heap memory pool.


Where is fair_unified_global used? I don’t seem to find it in the code.

Thanks. I have updated this.

mbutrovich · 2025-10-14T14:17:04Z

spark/src/main/scala/org/apache/comet/CometExecIterator.scala

+      val offHeapSize = ByteUnit.MiB.toBytes(conf.getSizeAsMb("spark.memory.offHeap.size"))
+      val memoryFraction = CometConf.COMET_EXEC_MEMORY_POOL_FRACTION.get()
+      val memoryLimit = (offHeapSize * memoryFraction).toLong
+      val memoryLimitPerTask = (memoryLimit.toFloat * coresPerTask / numCores).toLong


We should be able to use toDouble instead of toFloat here. I'm not super worried about rounding errors or overflow here, but better safe than sorry and we won't see a performance difference.

mbutrovich · 2025-10-14T14:17:19Z

spark/src/main/scala/org/apache/comet/CometExecIterator.scala

+      val memoryLimit = CometSparkSessionExtensions.getCometMemoryOverhead(conf)
+      // example 16GB maxMemory * 16 cores with 4 cores per task results
+      // in memory_limit_per_task = 16 GB * 4 / 16 = 16 GB / 4 = 4GB
+      val memoryLimitPerTask = (memoryLimit.toFloat * coresPerTask / numCores).toLong


Same comment about toDouble.

mbutrovich

This looks like a huge improvement in configuring Comet! Thanks @andygrove!

comphead · 2025-10-14T17:06:31Z

docs/source/user-guide/latest/tuning.md

 Comet Performance

- Comet requires at least 5 GB of RAM in off-heap mode and 6 GB RAM in on-heap mode, but performance at this level
+- Comet requires at least 5 GB of RAM, but performance at this level


is it onheap or offheap?

comphead

Thanks @andygrove

… documentation (apache#2538)

Refactor memory pool setup

b3b0a85

andygrove added 4 commits October 8, 2025 11:26

logging

191bbc4

improve logging

2b1242e

improve logging

f42be8f

prep for review

6e987af

andygrove commented Oct 8, 2025

View reviewed changes

andygrove added 5 commits October 8, 2025 12:28

revert

8ef56eb

more mem pool config choices

43e83c2

save

ecce61e

docs

f3051ba

docs

f364962

andygrove changed the title ~~feat: Add new config to limit Comet memory pool usage per task + bug fixes [WIP]~~ feat: Various improvements to off-heap memory pool configuration, logging, and documentation Oct 8, 2025

andygrove added 3 commits October 9, 2025 10:04

logging

edc3d0d

trace logging

5f5f95e

logging

8c8ad98

andygrove marked this pull request as ready for review October 10, 2025 02:45

andygrove added 4 commits October 10, 2025 03:09

revert some changes

83ecd6e

revert some changes

07c69e7

revert some changes

7e4a482

revert some changes

648f3ee

andygrove commented Oct 10, 2025

View reviewed changes

docs

767866b

andygrove requested review from comphead and parthchandra October 10, 2025 15:16

andygrove added 3 commits October 10, 2025 09:18

docs

d04d676

upmerge

7fc62b6

fix merge conflict

c1a2053

comphead reviewed Oct 10, 2025

View reviewed changes

andygrove added 3 commits October 10, 2025 13:47

address feedback

e36785b

upmerge

4848f64

separate memory pool config for on-heap vs off-heap

5e812de

andygrove changed the title ~~feat: Various improvements to off-heap memory pool configuration, logging, and documentation~~ feat: Various improvements to memory pool configuration, logging, and documentation Oct 14, 2025

andygrove added 2 commits October 13, 2025 18:12

remove on-heap memory tuning docs

31e9914

docs

2e125a1

andygrove requested a review from wForget October 14, 2025 00:18

coderfender reviewed Oct 14, 2025

View reviewed changes

parthchandra reviewed Oct 14, 2025

View reviewed changes

andygrove added 3 commits October 13, 2025 21:01

address feedback + format

e35f7e8

docs

303ba75

formatting

c333959

wForget reviewed Oct 14, 2025

View reviewed changes

remove references to fair_unified_global

c39afd9

mbutrovich reviewed Oct 14, 2025

View reviewed changes

use toDouble instead of toFloat

4816437

mbutrovich approved these changes Oct 14, 2025

View reviewed changes

comphead reviewed Oct 14, 2025

View reviewed changes

comphead approved these changes Oct 14, 2025

View reviewed changes

mbutrovich merged commit 68d756a into apache:main Oct 14, 2025
144 of 145 checks passed

andygrove deleted the refactor-mem-config branch October 14, 2025 19:23

andygrove mentioned this pull request Oct 15, 2025

Config changes related to on-heap / testing #2569

Closed

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

feat: Various improvements to memory pool configuration, logging, and…

c29487a

… documentation (apache#2538)

feat: Various improvements to memory pool configuration, logging, and documentation #2538

feat: Various improvements to memory pool configuration, logging, and documentation #2538

Uh oh!

Conversation

andygrove commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Oct 14, 2025

Uh oh!

coderfender Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderfender left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

andygrove commented Oct 8, 2025 •

edited

Loading

codecov-commenter commented Oct 8, 2025 •

edited

Loading

coderfender Oct 14, 2025 •

edited

Loading